1 Learning Objectives for Session 6

  1. What is the difference between supervised and unsupervised learning?
  2. What is the main objective of clustering?
  3. How does K-Means algorithm work?
    1. Objective
    2. Inputs
    3. Outputs
  4. How do we determine how many clusters we have in the data? (These tools can be applied with any clustering method).
    1. Elbow charts
    2. PCA visualization
    3. Comparing clustering results with different number of clusters
    4. Silhouette analysis

2 Introduction

Whisky (or whiskey, from the Gaelic uisge beatha, ‘water of life’) is technically defined as an alcoholic liquor obtained by distillation of a fermented starchy compound, usually a grain. Although distillation, including that of fermented grain, had been invented by the Chinese at least as far back as the 7th century AD, five centuries before it was introduced or rediscovered by Europeans, it is in Scotland that whisky has known its full development in diversity. In total there are currently more than 100 Scotch whisky distilleries, for a total of over 300 whiskies sold as single malts. This excludes the innumerable blended whiskies made of assemblages of liquors of different qualities or brands. It is interesting that Scotland alone has developed such a diversity of whiskies, matured and sold as single malts. There are only a handful of other pure malts in the world, specifically in Ireland, India and Japan. Single malts are well known by amateurs to differ widely in nose, colour, body, palate and finish. We are interested in discovering what are the main types of single-malt Scotches are and in what way they differ. We will use data from a connoisseur’s guide to malt whiskies of Scotland. This guide contains a description of single malts from 86 distilleries of Scotland, which we will use to answer the following questions.

  1. What are the major types of single-malt whiskies that can be recognized? What are their chief characteristics and the best representatives?

  2. What is the geographic component in that classification?

For wine, it is well known that the region its grapes were grown has an impact on taste. In fact the origins of wines are widely used in grouping them in stores and restaurant menus, eg. Bordeaux wine. This has become common practice for whiskies as well and distilleries even use their geographical location in naming their products, e.g. Highland Single Malt whisky, mimicking wine producers. In this workshop we will test if geographical region has an impact on taste for whiskies as well.

  1. Do the various categories of characteristics - nose, colour, body, palate and finish - lead to the same clustering?

But how do we obtain a quantitative assessment of similarity from the literary descriptions taste written by connoisseurs? We will use clustering algorithms that are designed to tackle such problems.

“Whisky.csv” file provides data on single malt whiskies from 86 different distilleries in Scotland for Scotch that was 10 years of age, or closest to that age. 86 malt whiskies are scored between 0-4 for 12 different taste categories including sweetness, smoky, nutty etc.

2.1 Clustering in R

This is the R Markdown document for Sessions 6 and 7 of AM04 Data Science course that contains all the functions and libraries we use in class as well as additional tools that we may not have time to cover. Please make sure you go over this document before coming to the second session during which you will use the functions in this document to determine clusters in a different data set. There are many questions and alternative implementations embedded in this document to facilitate your learning. Please go through these exercises to reinforce your understanding. This document is organized as follows.

Table of Contents

  • Section 1: Learning Objectives for Session 6
  • Section 2: Introduction
  • Section 3: Load and explore data
  • Section 4: Data Visualization
  • Section 5: How to run K-means clustering
  • Section 6: Visualizing the results of clustering algorithms
  • Section 7: In-Class workshop
  • Section 8: Determining the number of clusters
  • Section 9: Clustering results for Whisky Data
  • Section 10: Learning Objectives for Session 7
  • Section 11: Partitioning around medoids (Session 7)
  • Section 12: Hierarchical Clustering (Session 7)

3 Whisky Data: Data Exploration and preparation

In this session we will use tasting notes for whiskies. Let’s read the data and then take a look at summary statistics.

##Data is in csv format. So I use read.csv function
whisky <- read.csv(file="whisky.csv",header=TRUE)

##let's look at the top 5 rows
head(whisky,5)
##   RowID Distillery Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey
## 1     1  Aberfeldy    2         2     2         0       0     2     1     2
## 2     2   Aberlour    3         3     1         0       0     4     3     2
## 3     3     AnCnoc    1         3     2         0       0     2     0     0
## 4     4     Ardbeg    4         1     4         4       0     0     2     0
## 5     5    Ardmore    2         2     2         0       0     1     1     1
##   Nutty Malty Fruity Floral Postcode       lat     long
## 1     2     2      2      2     PH15 -3.850199 56.62519
## 2     2     3      3      2     AB38 -3.229644 57.46739
## 3     2     2      3      2     AB54 -2.785295 57.44175
## 4     1     2      1      0     PA42 -6.108503 55.64061
## 5     2     3      1      1     AB54 -2.743629 57.35056
##glimpse function is useful to see a series of values in a column
glimpse(whisky)
## Rows: 86
## Columns: 17
## $ RowID      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Distillery <chr> "Aberfeldy", "Aberlour", "AnCnoc", "Ardbeg", "Ardmore", "Ar…
## $ Body       <int> 2, 3, 1, 4, 2, 2, 0, 2, 2, 2, 4, 3, 4, 2, 3, 2, 1, 2, 2, 1,…
## $ Sweetness  <int> 2, 3, 3, 1, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
## $ Smoky      <int> 2, 1, 2, 4, 2, 1, 0, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 3, 2,…
## $ Medicinal  <int> 0, 0, 0, 4, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2,…
## $ Tobacco    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Honey      <int> 2, 4, 2, 0, 1, 1, 1, 2, 1, 0, 2, 3, 2, 2, 3, 2, 0, 1, 2, 2,…
## $ Spicy      <int> 1, 3, 0, 2, 1, 1, 1, 1, 0, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2,…
## $ Winey      <int> 2, 2, 0, 0, 1, 1, 0, 2, 0, 0, 3, 1, 0, 0, 1, 1, 1, 2, 1, 1,…
## $ Nutty      <int> 2, 2, 2, 1, 2, 0, 2, 2, 2, 2, 3, 0, 2, 0, 2, 2, 0, 2, 1, 2,…
## $ Malty      <int> 2, 3, 2, 2, 3, 1, 2, 2, 2, 1, 0, 2, 2, 2, 3, 2, 2, 2, 1, 2,…
## $ Fruity     <int> 2, 3, 3, 1, 1, 1, 3, 2, 2, 2, 1, 2, 2, 3, 2, 2, 2, 2, 1, 2,…
## $ Floral     <int> 2, 2, 2, 0, 1, 2, 3, 1, 2, 1, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2,…
## $ Postcode   <chr> " PH15", " AB38", " AB54", " PA42", " AB54", " KA27", " G81…
## $ lat        <dbl> -3.850199, -3.229644, -2.785295, -6.108503, -2.743629, -5.2…
## $ long       <dbl> 56.62519, 57.46739, 57.44175, 55.64061, 57.35056, 55.69915,…
##Finally I use the describe function to get a sense of the distribution of the values in each column
library(Hmisc)
describe(whisky)
## whisky 
## 
##  17  Variables      86  Observations
## --------------------------------------------------------------------------------
## RowID 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       86        0       86        1     43.5       29     5.25     9.50 
##      .25      .50      .75      .90      .95 
##    22.25    43.50    64.75    77.50    81.75 
## 
## lowest :  1  2  3  4  5, highest: 82 83 84 85 86
## --------------------------------------------------------------------------------
## Distillery 
##        n  missing distinct 
##       86        0       86 
## 
## lowest : Aberfeldy    Aberlour     AnCnoc       Ardbeg       Ardmore     
## highest: Tobermory    Tomatin      Tomintoul    Tormore      Tullibardine
## --------------------------------------------------------------------------------
## Body 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        5    0.843     2.07   0.9702 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency      2    19    45    11     9
## Proportion 0.023 0.221 0.523 0.128 0.105
## --------------------------------------------------------------------------------
## Sweetness 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        4    0.826    2.291   0.7488 
##                                   
## Value          1     2     3     4
## Frequency     10    44    29     3
## Proportion 0.116 0.512 0.337 0.035
## --------------------------------------------------------------------------------
## Smoky 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        5    0.822    1.535   0.8651 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency      4    45    28     5     4
## Proportion 0.047 0.523 0.326 0.058 0.047
## --------------------------------------------------------------------------------
## Medicinal 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        5    0.671   0.5465   0.8577 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency     59    15     7     2     3
## Proportion 0.686 0.174 0.081 0.023 0.035
## --------------------------------------------------------------------------------
## Tobacco 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##       86        0        2    0.308       10   0.1163   0.2079 
## 
## --------------------------------------------------------------------------------
## Honey 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        5    0.883    1.244   0.9146 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency     18    33    32     2     1
## Proportion 0.209 0.384 0.372 0.023 0.012
## --------------------------------------------------------------------------------
## Spicy 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        4    0.861    1.384   0.8375 
##                                   
## Value          0     1     2     3
## Frequency     12    33    37     4
## Proportion 0.140 0.384 0.430 0.047
## --------------------------------------------------------------------------------
## Winey 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        5    0.877   0.9767   0.9702 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency     29    37    15     3     2
## Proportion 0.337 0.430 0.174 0.035 0.023
## --------------------------------------------------------------------------------
## Nutty 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        5    0.841    1.465   0.8575 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency     12    27    43     3     1
## Proportion 0.140 0.314 0.500 0.035 0.012
## --------------------------------------------------------------------------------
## Malty 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        4    0.723    1.802   0.6131 
##                                   
## Value          0     1     2     3
## Frequency      2    21    55     8
## Proportion 0.023 0.244 0.640 0.093
## --------------------------------------------------------------------------------
## Fruity 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        4    0.802    1.802   0.7981 
##                                   
## Value          0     1     2     3
## Frequency      6    18    49    13
## Proportion 0.070 0.209 0.570 0.151
## --------------------------------------------------------------------------------
## Floral 
##        n  missing distinct     Info     Mean      Gmd 
##       86        0        5    0.803    1.698   0.8695 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency      9    19    49     7     2
## Proportion 0.105 0.221 0.570 0.081 0.023
## --------------------------------------------------------------------------------
## Postcode 
##        n  missing distinct 
##       86        0       43 
## 
## lowest :  AB30  AB35  AB37  AB38  AB51, highest:  PH26  PH33  PH4   PH7  AB45 
## --------------------------------------------------------------------------------
## lat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       86        0       77        1   -3.847    1.125   -6.125   -6.010 
##      .25      .50      .75      .90      .95 
##   -4.241   -3.344   -3.192   -2.962   -2.754 
## 
## lowest : -6.359364 -6.352166 -6.283806 -6.153129 -6.125946
## highest: -2.743629 -2.648927 -2.479332 -2.468547 -2.316943
## --------------------------------------------------------------------------------
## long 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       86        0       77        1     57.1   0.8446    55.64    55.80 
##      .25      .50      .75      .90      .95 
##    56.64    57.44    57.54    57.65    57.84 
## 
## lowest : 54.85819 55.42900 55.42981 55.62957 55.63561
## highest: 57.84273 58.01365 58.43488 58.96372 58.96701
## --------------------------------------------------------------------------------

Make sure you understand the type and scale of each variable.

Next let’s keep the tasting note columns only and then scale the data.

#### focus only on tasting points and so ignore geographical coordinates
whisky_tasting_notes<- whisky %>% select(Body:Floral)
#### scale data: By scaling we subtract the average value of a column from each observation and divide each observation by the standard deviation of the column it belongs to
whisky_tasting_notes<-data.frame(scale(whisky_tasting_notes))

print(mean(whisky_tasting_notes$Floral))
## [1] 9.35408e-17
print(sd(whisky_tasting_notes$Floral))
## [1] 1

4 Data visualization

Let’s also see where all these distilleries are.

#I display the locations of the distilleries on a map. I will use ggmap and Stamen Maps. 
#We will cover the details of this and other methods for visualizing geographical data in the data visualization course. 
#But if you want to display a different location use "https://www.openstreetmap.org" and `export' to find the coordinates of a location.

library(ggmap)

#these are the coordinates I got from openstreetmap.org for Scotland
scotlandOSM <- c(left = -9.146, bottom = 54.426, right = -0.841, top = 59.317) 
#let's get the map for Scotland
mapScotland<-ggmap(get_stamenmap(scotlandOSM, zoom = 8)) 
#now add the points of the distilliries 
mapScotland+geom_point(aes(x =lat, y = long),data=whisky,col="blue", alpha=0.8, size=1.5)

Let’s look at the distribution of some of the variables. (I assume you already know how to use basic ggplots.)

ggplot(whisky_tasting_notes, aes(Medicinal)) +
  geom_histogram(binwidth=1)

ggplot(whisky_tasting_notes, aes(x = Sweetness)) + 
  geom_histogram(binwidth=1)

ggplot(whisky_tasting_notes, aes(Tobacco)) +
  geom_histogram(binwidth=1)

ggplot(whisky_tasting_notes, aes(Smoky)) +
  geom_histogram(binwidth=1)

Let’s look at correlations.

# I will use the ggcorr plots in GGally package. Something you have used with Nicos.
library("GGally")
whisky_tasting_notes %>% 
  select(Sweetness:Floral) %>% #keep Y variable last
  ggcorr(method = c("pairwise", "pearson"), label_round=2, label = TRUE, angle = -90, max_size = 4,size = 3)

Which variables are most correlated? Based on the data description and the correlation table, are there any other variables you would like to visualize? Are there any other visualizations that might help you explore the data better?

5 K-means clustering

Find clusters using k-means with k=2. I will use eclust function (part of factoextra library) throughout the document. The most important options of kmeans function are k, number of clusters, and nstart, which is the number of random starts.

Also check the additional options here.

(https://cran.r-project.org/web/packages/factoextra/factoextra.pdf)

library(factoextra)

model_kmeans_2clusters<-eclust(whisky_tasting_notes, "kmeans", k = 2,nstart = 50, graph = FALSE)

#Let's check the components of this object.
summary(model_kmeans_2clusters)
##              Length Class      Mode   
## cluster      86     -none-     numeric
## centers      24     -none-     numeric
## totss         1     -none-     numeric
## withinss      2     -none-     numeric
## tot.withinss  1     -none-     numeric
## betweenss     1     -none-     numeric
## size          2     -none-     numeric
## iter          1     -none-     numeric
## ifault        1     -none-     numeric
## silinfo       3     -none-     list   
## nbclust       1     -none-     numeric
## data         12     data.frame list
#Size of the clusters

model_kmeans_2clusters$size
## [1] 77  9

Take a look at the cluster sizes. What can you conclude?

After the lecture, run kmeans function by setting nstart=100. Examine the results.

# Extract the cluster assignment vector from the kmeans model
# add it to the original data frame
whisky_tasting_notes_withClusters<-mutate(whisky_tasting_notes, cluster = as.factor(model_kmeans_2clusters$cluster))

6 Visualizing clusters

There are a variety of methods and libraries to visualize clusters. We will look into three different methods.

6.1 Visualizing clusters by variables

First I plot the positions of the points in each cluster vs two variables. Which variables should we use? Can we determine which variables are most important from this and the visualization of the centers in the next section? (Hint: The answer is yes!)

library(ggpubr)
a<-ggplot(whisky_tasting_notes_withClusters, aes(x = Medicinal, y = Sweetness, color =  as.factor(cluster))) +
  geom_jitter()+labs(color = "Cluster")
# Note that geom_jitter adds a small noise to each observation so that we can see overlapping points

b<-ggplot(whisky_tasting_notes_withClusters, aes(x = Medicinal, y = Sweetness, color = as.factor(cluster),size=Smoky)) +
  geom_jitter()+labs(color = "Cluster")

#Let's arrange these visualizations so that they fit in the html file nicely
library(gridExtra)
grid.arrange(a, b, nrow = 2)

Do you observe noticeable differences between clusters? Can you explain the differences in plain English? Which variables (among the ones we considered above) play a more prominent role in determining clusters?

6.2 Visualizing clusters - Cluster centers

However plotting points vs variables is not very helpful when we have several variables, hence we need to do something different. First we look at the centers of the clusters.

#Plot centers for k=2

#First generate a new data frame with cluster centers and cluster numbers
cluster_centers<-data.frame(cluster=as.factor(c(1:2)),model_kmeans_2clusters$centers)

#transpose this data frame
cluster_centers_t<-cluster_centers %>% gather(variable,value,-cluster,factor_key = TRUE)

#plot the centers
graphkmeans_2clusters<-ggplot(cluster_centers_t, aes(x = variable, y = value))+  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=1,shape=4)+geom_hline(yintercept=0)+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),)+ggtitle("K-means Centers k=2")

graphkmeans_2clusters

Make sure you understand the details of the code above after the lecture. You will need them next week.

Do you observe noticeable differences between clusters? Can you explain the differences in plain English? Which variables play a more prominent role in determining clusters (among the ones we considered)?

6.3 Visualizing clusters - PCA

If we want to see where individual points from each cluster fall, instead of just cluster centers, we can use components from Principle Component Analysis (PCA). (You will discuss PCA in detail in the first session of your machine learning class.) This is especially helpful to see how well points in different clusters are separated.

fviz_cluster(model_kmeans_2clusters, whisky_tasting_notes, palette = "Set2", ggtheme = theme_minimal())

Do you observe noticeable differences between clusters? Can you explain the differences in plain English?

This function has numerous options. See https://www.rdocumentation.org/packages/factoextra/versions/1.0.5/topics/fviz_cluster

After the lecture, set geom=“point” to add more information in the visualization.

The main goal of clustering is to find actionable clusters. Visualizations help use determine if the clusters are different enough and they provide us insights. Of course what insights we are interested in depends on our eventual goal.

“What are the major types of single-malt whiskies that can be recognized? What are their chief characteristics and the best representatives?”

Can you answer these questions from the visualizations?

7 Exercise I

25 mins breakout room + 10 mins class discussion

Instructions

The purpose of this exercise is to analyze the results of the K-means clustering method for the whisky data with a different number of clusters.

In this (unassessed) mini-workshop you will use K-means method to determine the clusters with 3 clusters and visualize the results.

You have 20 minutes to complete the workshop. If you have any questions I will be in the main room. Please leave your breakout room to ask questions. You can return to your breakout room from the main room afterwards. We will discuss your findings after the break exercise. I will randomly choose a group to share their results.

Learning objectives

  1. Understand the parameters of the K-means clustering algorithm.
  2. Using different visualization techniques to investigate the results of a clustering algorithm. Specifically
    1. use individual variables vs clustering results and interpret the results
    2. plot cluster centers and interpret the results and interpret the results
    3. use principle components to see the separation between clusters.


8 Determining the Number of Clusters

Now that we know how to visualize the results of clustering (these tools can be used with any clustering algorithm) and how to interpret the results, we next turn to a different question: How many clusters do we need?

8.1 Elbow chart

The elbow chart is the most popular tool used for determining the number of clusters present in the data. Elbow charts plot the total distance from observations to their cluster centers as a function of number of clusters.

In K-Means the total distance decreases as we increase the number of clusters. In fact if we choose the number of clusters equal to number of clusters the total distance would be zero. (Why?) Therefore, there is no optimal stopping point, i.e., we cannot choose the number of clusters that minimizes the total distance.

In elbow charts we try to pick the number of clusters based on the amount the total distance decreases as we increase the number of clusters. The assumption is that the rate the total distance decreases diminishes as we increase the number of clusters. Then we try to pick the number of clusters before the decrease becomes very low.

Elbow charts do not provide a definitive answer as to what the optimal number of clusters is. However it is a good tool to determine an upper bound for the number of clusters. Recall that the most important criteria in determining the number of clusters are interpretability and actionability. Hence elbow charts are used to verify and strengthen our conclusions.

First I demonstrate a detailed way of generating an elbow chart. We will see below specific functions –that are part of factoextra library– built for this. Note that this method can be used with other clustering algorithms (see below for a few) as well.

library(purrr) #a package for writing succinctfor loops

# Use map_dbl to run K-Means models with varying value of k 
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = whisky_tasting_notes, centers = k,iter.max = 100, nstart = 10)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10 ,
  tot_withinss = tot_withinss
)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

#Here is a short way of producing the elbow chart using "fviz_nbclust" function. 
fviz_nbclust(whisky_tasting_notes,kmeans, method = "wss")+
  labs(subtitle = "Elbow method")

Do you see an elbow shape in the plot –where the reduction in total distance becomes considerably smaller than previous ones?

8.2 PCA visualization

We can also use PCA again to compare the results of different number of clusters.

# eclust function (part of factoextra) makes it easier to visuliaze clustering results
model_km2 <- eclust(whisky_tasting_notes, "kmeans", k = 2,nstart = 50, graph = FALSE)
model_km2$size
## [1] 77  9
model_km3 <- eclust(whisky_tasting_notes, "kmeans", k = 3,nstart = 50, graph = FALSE)
model_km3$size
## [1] 42 35  9
model_km4 <- eclust(whisky_tasting_notes, "kmeans", k = 4,nstart = 50, graph = FALSE)
model_km4$size
## [1] 38 26  6 16
model_km5 <- eclust(whisky_tasting_notes, "kmeans", k = 5,nstart = 50, graph = FALSE)
model_km5$size
## [1]  6  6 16 28 30
# plots to compare
#I use the fviz_cluster function which is part of the`factoextra` library
p1 <- fviz_cluster(model_km2, geom = "point", data = whisky_tasting_notes) + ggtitle("k = 2")
p2 <- fviz_cluster(model_km3, geom = "point",  data = whisky_tasting_notes) + ggtitle("k = 3")
p3 <- fviz_cluster(model_km4, geom = "point",  data = whisky_tasting_notes) + ggtitle("k = 4")
p4 <- fviz_cluster(model_km5, geom = "point",  data = whisky_tasting_notes) + ggtitle("k = 5")

library(gridExtra)
grid.arrange(p1, p2,p3,p4, nrow = 2)

8.3 Silhoutte analysis

The Silhoutte analysis provides a succinct graphical representation of how well each object has been classified. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

#I use the fviz_silhouette function which is part of the`factoextra` library
s2<-fviz_silhouette(model_km2)+ ggtitle(paste("k = 2", "avg sw=",format(round(model_km2$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1   77          0.31
## 2       2    9          0.25
s3<-fviz_silhouette(model_km3)+ ggtitle(paste("k = 3", "avg sw=",format(round(model_km3$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1   42          0.14
## 2       2   35          0.12
## 3       3    9          0.23
s4<-fviz_silhouette(model_km4)+ ggtitle(paste("k = 4", "avg sw=",format(round(model_km4$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1   38          0.16
## 2       2   26          0.16
## 3       3    6          0.29
## 4       4   16          0.05
s5<-fviz_silhouette(model_km5)+ ggtitle(paste("k = 5", "avg sw=",format(round(model_km5$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1    6          0.28
## 2       2    6          0.11
## 3       3   16          0.08
## 4       4   28          0.14
## 5       5   30          0.17
grid.arrange(s2, s3,s4,s5, nrow = 2)

fviz_nbclust(whisky_tasting_notes, kmeans, method = "silhouette",k.max = 15)+labs(subtitle = "Silhouette method")

Unlike the elbow chart, the average silhoutte width does not have to decrease in the number of clusters. What do you observe?

We can also find the point which were not assinged to the nearest cluster.

# Silhouette width of observation
sil <- model_km3$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]
##    cluster neighbor   sil_width
## 56       3        1 -0.02845223

8.4 Comparing results with different k’s

Let’s look at the results from k=3, 4 and 5. This will help us see which clusters survive and what clusters emerge. We will use this to verify the stopping point in terms of k.

#Plot centers
#Note that I use a slightly different way of plotting the centers.

#Plot centers for k=3
xa<-data.frame(cluster=as.factor(c(1:3)),model_km3$centers)
xa2k3<-xa %>% gather(variable,value,-cluster,factor_key = TRUE)

graphknn3<-ggplot(xa2k3, aes(x = variable, y = value))+  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=1,shape=4)+geom_hline(yintercept=0)+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),legend.title=element_text(size=5),legend.text = element_text(size=5))+ggtitle("K-means Centers k=3")

#Plot centers for k=4
xa<-data.frame(cluster=as.factor(c(1:4)),model_km4$centers)

xa4<-xa %>% gather(variable,value,-cluster,factor_key = TRUE)
graphknn4<-ggplot(xa4, aes(x = variable, y = value))+  geom_line(aes(color = cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=1,shape=4)+geom_hline(yintercept=0)+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),legend.title=element_text(size=5),legend.text = element_text(size=5))+ggtitle("K-means Centers k=4")

#Plot centers for k=5
xa<-data.frame(cluster=as.factor(c(1:5)),model_km5$centers)

xa2<-xa %>% gather(variable,value,-cluster,factor_key = TRUE)
graphknn5<-ggplot(xa2, aes(x = variable, y = value))+  geom_line(aes(color = cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=1,shape=4)+geom_hline(yintercept=0)+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),legend.title=element_text(size=5),legend.text = element_text(size=5))+ggtitle("K-means Centers k=5")
  

model_km3$size
## [1] 42 35  9
model_km4$size
## [1] 38 26  6 16
model_km5$size
## [1]  6  6 16 28 30
grid.arrange(graphknn3,graphknn4,graphknn5, nrow = 3)

As we add more clusters we discover different clusters of whiskies. By looking at the cluster centers and number of whiskies in each cluster, try to determine how many clusters we should have in our results? What can you conclude? Also check if your conclusions are in agreement with the elbow chart, and silhouette and gap statistic analysis.

More specifically answer the following questions:

  1. What clusters emerge when we increase the number of clusters?

  2. Are these new clusters significantly different from the already existing ones?

  3. What clusters remain the same when we increase the number of clusters?

  4. How can we use these graphs to determine which clusters are present in the data?

9 Clusters on the map

Altough we may not agree on the number of clusters in this data set, I think we are in a good position to answer the second question:

“What is the geographic component in our classification?”

Let’s set the number of clusters equal to 3 at investigate if there is any correlation between location and clusters. After class, check if your conclusions change when you set the number of clusters equal to 4 and 5.

mapScotland+ geom_point(aes(x =lat, y = long,color= as.factor(model_km3$cluster)),data=whisky, alpha=0.8, size=2)+labs(color = "Cluster")+scale_color_hue(l=25)

Does the geographical location play a role in clustering results? Can you suggest another way to check this?

10 Learning Objectives for Session 7

  1. How to use multiple cluster methods to determine meaningful clusters
    1. Comparing results of clustering methods
    2. Identifying persistent clusters
  2. How does PAM algorithm work?
    1. Objective
    2. Inputs
    3. Outputs
  3. How does Hierarchical clustering algorithm work?
    1. Objective
    2. Inputs
    3. Outputs

11 Partitioning Around Medoids (PAM) (Session 7)

The PAM method is very similar to k-means. We can implement it using the eclust function. (See the slides for technical details.) Once we obtain the results, we can use the visualization tools in a similar way we did above for k-means method.

#Let's use pam clustering. Again `k` is the number of clusters
k=3
k2_pam <-eclust(whisky_tasting_notes, "pam", k = k, graph = FALSE)
#Let's see the cluster sizes
k2_pam$medoids
##             Body  Sweetness      Smoky  Medicinal   Tobacco      Honey
## [1,] -0.07498567 -0.4052738  0.5385702 -0.5520139 -0.360623  0.8858842
## [2,] -1.14978021  0.9888682 -0.6193557 -0.5520139 -0.360623 -0.2862087
## [3,]  0.99980888 -0.4052738  1.6964961  2.4781900  2.740735 -1.4583017
##           Spicy       Winey      Nutty      Malty    Fruity     Floral
## [1,]  0.7853832  0.02493226  0.6509243  0.3142208 0.2536114  0.3535903
## [2,] -0.4890122 -1.04715499 -0.5660211  0.3142208 0.2536114  0.3535903
## [3,]  0.7853832 -1.04715499 -0.5660211 -1.2753668 0.2536114 -1.9855456

In a way similar to K-means, we can visualize the results in different ways to determine the number of clusters we will use.

Let’s look at the centers of the clusters.

#First generate a new data frame with cluster medoids and cluster numbers
cluster_medoids<-data.frame(cluster=as.factor(c(1:k)),k2_pam$medoids)

#transpose this data frame
cluster_medoids_t<-cluster_medoids %>% gather(variable,value,-cluster,factor_key = TRUE)

#plot medoids
graphkmeans_3Pam<-ggplot(cluster_medoids_t, aes(x = variable, y = value))+  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=1,shape=4)+geom_hline(yintercept=0)+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),)+ggtitle("Pam Medoids k=3")

graphkmeans_3Pam

#We can also visualize the centers (instead of the medoids) to make a more fair comparison
whisky_tasting_notes_withClusters<-mutate(whisky_tasting_notes, 
                                   cluster = as.factor(k2_pam$cluster))

center_locations <- whisky_tasting_notes_withClusters%>% group_by(cluster) %>% summarize_at(vars(Body:Floral),mean)

#Next I use gather to collect information together
xa2p<- gather(center_locations, key = "variable", value = "value",-cluster,factor_key = TRUE)

#Next I use ggplot to visualize centers
pamcenters<-ggplot(xa2p, aes(x = variable, y = value))+  geom_line(aes(color = cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=2,shape=4)+geom_hline(yintercept=0)+ggtitle(paste("PAM Centers k=",k))+labs(fill = "Cluster")+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),legend.title=element_text(size=5),legend.text = element_text(size=5))+scale_colour_manual(values = c("darkgreen", "orange", "red","blue"))


pamcenters

## Compare it with Kmeans
## Let me make he Kmeans graph look similar
graphknn42<-ggplot(xa2k3, aes(x = variable, y = value))+  geom_line(aes(color = cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=2,shape=4)+geom_hline(yintercept=0)+theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),legend.title=element_text(size=5),legend.text = element_text(size=5))+ggtitle("K-Means Centers k=3")
graphknn42

Do you observe significant differences between the results under PAM and K-Means. After the lecture, set the number of clusters equal to 2 and compare the results of two methods again.

Let’s visualize the results using PCA.

fviz_cluster(model_km3, data = whisky_tasting_notes) + ggtitle("K-means k = 3")+scale_colour_manual(values = c("darkgreen", "orange", "red","blue"))

fviz_cluster(k2_pam, data = whisky_tasting_notes) + ggtitle("PAM k = 3")+scale_colour_manual(values = c("darkgreen", "orange", "red","blue"))

We can also use the other visualization tools with PAM results.

#Let's look at the elbow chart
fviz_nbclust(whisky_tasting_notes,FUNcluster= cluster::pam, method = "wss")+
  labs(subtitle = "Elbow method")

#Let's look at the silhouette analysis
fviz_silhouette(k2_pam)+ ggtitle(paste("k = 3", "avg sw=",format(round(k2_pam$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1   43          0.09
## 2       2   36          0.14
## 3       3    7          0.30

PAM is helpful in checking the robustness of our conclusions from K-Means because it is less sensitive to outliers. Why?

12 Hierarchical Clustering

Finding clusters using h(ierarchical)-clustering requires a slighlty different approach than K-means and PAM;

  1. First find the distances between points.
  2. Then determine how to form the clusters
#dist function find the distances between points
res.dist <- dist(whisky_tasting_notes, method = "euclidean")

Now we find the clusters using H-clustering.

#Let's fit a H-Clustering methods. k is the number of clusters
#hc_method is the distance metric H-Clustering uses. See slides for more.
# I will set the number of clusters equal to 3 although we can do this after we find the distances.
# hcut is also part of factoextra library
res.hc <-  hcut(res.dist, hc_method = "ward.D",k=3)
summary(res.hc)
##             Length Class  Mode     
## merge        170   -none- numeric  
## height        85   -none- numeric  
## order         86   -none- numeric  
## labels         0   -none- NULL     
## method         1   -none- character
## call           3   -none- call     
## dist.method    1   -none- character
## cluster       86   -none- numeric  
## nbclust        1   -none- numeric  
## silinfo        3   -none- list     
## size           3   -none- numeric  
## data        3655   dist   numeric
fviz_silhouette(res.hc)
##   cluster size ave.sil.width
## 1       1   28          0.16
## 2       2   51          0.09
## 3       3    7          0.30

#Let's look at the size of the clusters
res.hc$size
## [1] 28 51  7
#Before we look at more sophisticated ways of visualizing the results, let's first use the base plot function with the results. Base method works faster and so it's more appropriate for large treees.
  plot(res.hc,hang = -1, cex = 0.5)

For larger data sets there are advanced functions to manage the computational requirements. We will not cover these methods in this class.

12.1 Visualizing results

We can use the same visualizations we have with K-means but there are also some special visualization tools for H-Clustering. Let’s take a look at these first.

#This visualization tool has many different options, I am only using the basic one below.
fviz_dend(res.hc, cex = 0.5, main="ward.D",lwd = 0.5)

Let’s look at cluster centers.

#First let's find the averages of the variables by cluster
whisky_tasting_notes_withClusters<-mutate(whisky_tasting_notes, 
                                   cluster = as.factor(res.hc$cluster))

center_locations <- whisky_tasting_notes_withClusters%>% group_by(cluster) %>% summarize_at(vars(Body:Floral),mean)

#Next I use gather to collect information together
xa2<- gather(center_locations, key = "variable", value = "value",-cluster,factor_key = TRUE)

#Next I use ggplot to visualize centers
hclust_center<-ggplot(xa2, aes(x = variable, y = value,order=cluster))+  geom_line(aes(color = cluster,group = cluster), linetype = "dashed",size=1)+ geom_point(size=2,shape=4)+geom_hline(yintercept=0)+ggtitle("H-clust K=3")+labs(fill = "Cluster")+scale_colour_manual(values = c("darkgreen", "orange", "red","blue")) 
## Compare it with KMeans
hclust_center

graphknn42

Do you observe significant differences between the results under H-clustering and K-Means. After the lecture, increase the number of clusters to 5 and compare the results of H-clustering and K-Means.

Let’s look at the PCA visualization.

fviz_cluster(res.hc, whisky_tasting_notes,
             palette = "Set2", ggtheme = theme_minimal()) + ggtitle("H-Clust k = 3")+scale_colour_manual(values = c("darkgreen", "orange", "red","blue")) 

#add option geom = "point",

Let’s look at the elbow chart.

fviz_nbclust(whisky_tasting_notes, FUN = hcut, method = "wss")

Let’s look at the Silhoutte chart.

fviz_silhouette(res.hc)
##   cluster size ave.sil.width
## 1       1   28          0.16
## 2       2   51          0.09
## 3       3    7          0.30

knitr::knit_exit()